Search CORE

267 research outputs found

ConDeTri - A Content Dependent Read Trimmer for Illumina Data

Author: A Ratan
Axel Künstner
D Zerbino
DR Kelley
ER Mardis
F Sanger
H Li
H Li
I Kozarewa
J Miller
J Schröder
JC Dohm
JC Dohm
JR Miller
K Scheibye-Alsing
L Ilie
L Salmela
L Ye
Linnéa Smeds
M Margulies
Maureen J. Donlin
ML Metzker
MP Cox
P Pevzner
R Li
R Li
S Kurtz
TR Gregory
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

During the last few years, DNA and RNA sequencing have started to play an increasingly important role in biological and medical applications, especially due to the greater amount of sequencing data yielded from the new sequencing machines and the enormous decrease in sequencing costs. Particularly, Illumina/Solexa sequencing has had an increasing impact on gathering data from model and non-model organisms. However, accurate and easy to use tools for quality filtering have not yet been established. We present ConDeTri, a method for content dependent read trimming for next generation sequencing data using quality scores of each individual base. The main focus of the method is to remove sequencing errors from reads so that sequencing reads can be standardized. Another aspect of the method is to incorporate read trimming in next-generation sequencing data processing and analysis pipelines. It can process single-end and paired-end sequence data of arbitrary length and it is independent from sequencing coverage and user interaction. ConDeTri is able to trim and remove reads with low quality scores to save computational time and memory usage during de novo assemblies. Low coverage or large genome sequencing projects will especially gain from trimming reads. The method can easily be incorporated into preprocessing and analysis pipelines for Illumina data

Public Library of Science (PLOS)

Crossref

Publikationer från Uppsala Universitet

Directory of Open Access Journals

PubMed Central

Digitala Vetenskapliga Arkivet - Academic Archive On-line

SolexaQA: At-a-glance quality assessment of Illumina second-generation sequencing data

Author: A Martínez-Alcántara
Daniel A Peterson
DR Zerbino
GJ Hannon
J Rougemont
JC Dohm
ML Metzker
Murray P Cox
P Pavlidis
Patrick J Biggs
PC Dolan
PJA Cock
R Development Core Team
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Illumina's second-generation sequencing platform is playing an increasingly prominent role in modern DNA and RNA sequencing efforts. However, rapid, simple, standardized and independent measures of run quality are currently lacking, as are tools to process sequences for use in downstream applications based on read-level quality data. Results We present SolexaQA, a user-friendly software package designed to generate detailed statistics and at-a-glance graphics of sequence data quality both quickly and in an automated fashion. This package contains associated software to trim sequences dynamically using the quality scores of bases within individual reads. Conclusion The SolexaQA package produces standardized outputs within minutes, thus facilitating ready comparison between flow cell lanes and machine runs, as well as providing immediate diagnostic information to guide the manipulation of sequence data for downstream analyses.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Evaluation of statistical methods for normalization and differential expression in mRNA-Seq experiments

Author: A Lee
A Mortazavi
A Oshlack
B Ewing
B Langmead
DR Bentley
DY Chiang
Elizabeth Purdom
ET Wang
H Li
Illumina
Illumina
J Lu
James H Bullard
JC Dohm
JC Marioni
Kasper D Hansen
MA Taub
MAQC Consortium
MD Robinson
PAC Hoen
RA Irizarry
RA Irizarry
RD Canales
S Durinck
Sandrine Dudoit
U Nagalakshmi
Publication venue: BioMed Central
Publication date: 21/04/2009
Field of study

Abstract Background High-throughput sequencing technologies, such as the Illumina Genome Analyzer, are powerful new tools for investigating a wide range of biological and medical questions. Statistical and computational methods are key for drawing meaningful and accurate conclusions from the massive and complex datasets generated by the sequencers. We provide a detailed evaluation of statistical methods for normalization and differential expression (DE) analysis of Illumina transcriptome sequencing (mRNA-Seq) data. Results We compare statistical methods for detecting genes that are significantly DE between two types of biological samples and find that there are substantial differences in how the test statistics handle low-count genes. We evaluate how DE results are affected by features of the sequencing platform, such as, varying gene lengths, base-calling calibration method (with and without phi X control lane), and flow-cell/library preparation effects. We investigate the impact of the read count normalization method on DE results and show that the standard approach of scaling by total lane counts (e.g., RPKM) can bias estimates of DE. We propose more general quantile-based normalization procedures and demonstrate an improvement in DE detection. Conclusions Our results have significant practical and methodological implications for the design and analysis of mRNA-Seq experiments. They highlight the importance of appropriate statistical methods for normalization and DE inference, to account for features of the sequencing platform that could impact the accuracy of results. They also reveal the need for further research in the development of statistical and computational methods for mRNA-Seq.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

Collection Of Biostatistics Research Archive

Inherent Signals in Sequencing-Based Chromatin-ImmunoPrecipitation Control Libraries

Author: AA Bhinge
CL Wei
CY Lin
D Karolchik
DE Schones
DS Johnson
Edwin Cheung
G Bourque
I. King Jordan
JC Dohm
L Conti
LW Hillier
Nallasivam Palanisamy
S Impey
TS Mikkelsen
VB Vega
Vinsensius B. Vega
Wing-Kin Sung
X Chen
Y Benjamini
Y Zhang
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

The growth of sequencing-based Chromatin Immuno-Precipitation studies call for a more in-depth understanding of the nature of the technology and of the resultant data to reduce false positives and false negatives. Control libraries are typically constructed to complement such studies in order to mitigate the effect of systematic biases that might be present in the data. In this study, we explored multiple control libraries to obtain better understanding of what they truly represent.First, we analyzed the genome-wide profiles of various sequencing-based libraries at a low resolution of 1 Mbp, and compared them with each other as well as against aCGH data. We found that copy number plays a major influence in both ChIP-enriched as well as control libraries. Following that, we inspected the repeat regions to assess the extent of mapping bias. Next, significantly tag-rich 5 kbp regions were identified and they were associated with various genomic landmarks. For instance, we discovered that gene boundaries were surprisingly enriched with sequenced tags. Further, profiles between different cell types were noticeably distinct although the cell types were somewhat related and similar.We found that control libraries bear traces of systematic biases. The biases can be attributed to genomic copy number, inherent sequencing bias, plausible mapping ambiguity, and cell-type specific chromatin structure. Our results suggest careful analysis of control libraries can reveal promising biological insights

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS

Evaluating the Fidelity of De Novo Short Read Metagenomic Assembly Using Simulated Data

Author: A Brady
A Charuvaka
A Lopez-Bueno
AE Darling
Andrés Moya
D Hernandez
DB Jaffe
DB Rusch
DC Richter
DD Sommer
DH Haft
DH Huson
DR Zerbino
EW Myers
F Meyer
GG Sutton
GW Tyson
I Letunic
I Maccallum
J Laserson
J Qin
JA Huber
JC Dohm
JC Dohm
JC Wooley
JO Korbel
Jonathan H. Badger
JR Miller
JR Miller
JT Simpson
K Liolios
K Mavromatis
KE Wommack
L Krause
M de la Bastide
M Margulies
M Pop
M Stark
M Wu
Miguel Pignatelli
MJ Chaisson
NN Diaz
OU Nalbantoglu
PJ Turnbaugh
PJ Turnbaugh
R Li
R Seshadri
RD Finn
RL Tatusov
RL Warren
S Batzoglou
S Levy
S Yooseph
SM Huse
SR Gill
T Schoenfeld
TS Ghosh
VM Markowitz
WJ Kent
WR Jeck
X Huang
X Huang
Y Ye
Publication venue: Public Library of Science
Publication date: 23/05/2011
Field of study

A frequent step in metagenomic data analysis comprises the assembly of the sequenced reads. Many assembly tools have been published in the last years targeting data coming from next-generation sequencing (NGS) technologies but these assemblers have not been designed for or tested in multi-genome scenarios that characterize metagenomic studies. Here we provide a critical assessment of current de novo short reads assembly tools in multi-genome scenarios using complex simulated metagenomic data. With this approach we tested the fidelity of different assemblers in metagenomic studies demonstrating that even under the simplest compositions the number of chimeric contigs involving different species is noticeable. We further showed that the assembly process reduces the accuracy of the functional classification of the metagenomic data and that these errors can be overcome raising the coverage of the studied metagenome. The results presented here highlight the particular difficulties that de novo genome assemblers face in multi-genome scenarios demonstrating that these difficulties, that often compromise the functional classification of the analyzed data, can be overcome with a high sequencing effort

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Integrating Sequencing Technologies in Personal Genomics: Optimal Low Cost Reconstruction of Structural Variants

Author: AE Urban
D Pinkel
DA Wheeler
DR Bentley
DR Zerbino
F Sanger
GH Perry
J Butler
J Rozowsky
JC Dohm
JC Venter
Jiang Du
JO Korbel
JO Korbel
JY Hehir-Kwa
M Margulies
M Pop
M Pop
Mark B. Gerstein
Michael Snyder
MJ Chaisson
PA Pevzner
R Lippert
R Redon
R Schmid
RL Warren
Robert D. Bjornson
RR Selzer
S Batzoglou
S Levy
SMD Goldberg
V Bansal
William Stafford Noble
Yong Kong
Zhengdong D. Zhang
Publication venue: Public Library of Science
Publication date: 01/07/2009
Field of study

The goal of human genome re-sequencing is obtaining an accurate assembly of an individual's genome. Recently, there has been great excitement in the development of many technologies for this (e.g. medium and short read sequencing from companies such as 454 and SOLiD, and high-density oligo-arrays from Affymetrix and NimbelGen), with even more expected to appear. The costs and sensitivities of these technologies differ considerably from each other. As an important goal of personal genomics is to reduce the cost of re-sequencing to an affordable point, it is worthwhile to consider optimally integrating technologies. Here, we build a simulation toolbox that will help us optimally combine different technologies for genome re-sequencing, especially in reconstructing large structural variants (SVs). SV reconstruction is considered the most challenging step in human genome re-sequencing. (It is sometimes even harder than de novo assembly of small genomes because of the duplications and repetitive sequences in the human genome.) To this end, we formulate canonical problems that are representative of issues in reconstruction and are of small enough scale to be computationally tractable and simulatable. Using semi-realistic simulations, we show how we can combine different technologies to optimally solve the assembly at low cost. With mapability maps, our simulations efficiently handle the inhomogeneous repeat-containing structure of the human genome and the computational complexity of practical assembly algorithms. They quantitatively show how combining different read lengths is more cost-effective than using one length, how an optimal mixed sequencing strategy for reconstructing large novel SVs usually also gives accurate detection of SNPs/indels, how paired-end reads can improve reconstruction efficiency, and how adding in arrays is more efficient than just sequencing for disentangling some complex SVs. Our strategy should facilitate the sequencing of human genomes at maximum accuracy and low cost

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Fast Mapping of Short Sequences with Mismatches, Insertions and Deletions Using Index Structures

Author: B Langmead
Christian Otto
Cynthia M. Sharma
David B. Searls
G Myers
H Li
H Li
H Lin
JC Dohm
JM Rothberg
Jörg Hackermüller
Jörg Vogel
K Prüfer
M Crochemore
MI Abouelhoda
P Ferragina
Peter F. Stadler
Philipp Khaitovich
R Li
S Bennett
S Huse
S Karlin
SM Rumble
Stefan Kurtz
Steve Hoffmann
W Chang
Publication venue: Public Library of Science
Publication date: 01/01/2009
Field of study

With few exceptions, current methods for short read mapping make use of simple seed heuristics to speed up the search. Most of the underlying matching models neglect the necessity to allow not only mismatches, but also insertions and deletions. Current evaluations indicate, however, that very different error models apply to the novel high-throughput sequencing methods. While the most frequent error-type in Illumina reads are mismatches, reads produced by 454's GS FLX predominantly contain insertions and deletions (indels). Even though 454 sequencers are able to produce longer reads, the method is frequently applied to small RNA (miRNA and siRNA) sequencing. Fast and accurate matching in particular of short reads with diverse errors is therefore a pressing practical problem. We introduce a matching model for short reads that can, besides mismatches, also cope with indels. It addresses different error models. For example, it can handle the problem of leading and trailing contaminations caused by primers and poly-A tails in transcriptomics or the length-dependent increase of error rates. In these contexts, it thus simplifies the tedious and error-prone trimming step. For efficient searches, our method utilizes index structures in the form of enhanced suffix arrays. In a comparison with current methods for short read mapping, the presented approach shows significantly increased performance not only for 454 reads, but also for Illumina reads. Our approach is implemented in the software segemehl available at http://www.bioinf.uni-leipzig.de/Software/segemehl/

Public Library of Science (PLOS)

Crossref

Fraunhofer-ePrints

Directory of Open Access Journals

PubMed Central

Q&A: ChIP-seq technologies and the study of gene regulation

Author: A Barski
A Barski
A Goren
A Visel
AP Boyle
C Taslim
C Zang
DA Nix
Edison T Liu
G Bourque
H Xu
I Kozarewa
J Rozowsky
JC Dohm
L Teytelman
M Guttman
Mikael Huss
MJ Fullwood
PV Kharchenko
R Jothi
S Pepke
Sebastian Pott
TD Laajala
TS Mikkelsen
VB Vega
X Chen
Y Zhang
Z Wang
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

10.1186/1741-7007-8-56BMC Biology85

Crossref

The Jackson Laboratory: The Mouseion at the JAXlibrary

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

ScholarBank@NUS

WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads

Author: A Campbell
A Schlüter
AC McHardy
Alexander Goesmann
C Simon
CR Woese
CR Woese
D Pushkarev
DH Huson
DR Bentley
EA Dinsdale
EK Wommack
F Meyer
Felix Tille
GW Tyson
H Teeling
J Felsenstein
JC Dohm
JC Venter
Jens Stoye
L Krause
L Krause
M Ashburner
M Breitbart
N Saitou
NN Diaz
PD Schloss
R Durbin
R Rosenkranz
S Karlin
SA Sandin
Sebastian Jünemann
SF Altschul
SG Tringe
SJ Giovannoni
SR Eddy
T Abe
W Gish
Wolfgang Gerlach
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Gerlach W, Jünemann S, Tille F, Goesmann A, Stoye J. WebCARMA: a web application for the functional and taxonomic classification of unassembled metagenomic reads. BMC Bioinformatics. 2009;10(1):430.Background Metagenomics is a new field of research on natural microbial communities. High-throughput sequencing techniques like 454 or Solexa-Illumina promise new possibilities as they are able to produce huge amounts of data in much shorter time and with less efforts and costs than the traditional Sanger technique. But the data produced comes in even shorter reads (35-100 basepairs with Illumina, 100-500 basepairs with 454-sequencing). CARMA is a new software pipeline for the characterisation of species composition and the genetic potential of microbial samples using short, unassembled reads. Results In this paper, we introduce WebCARMA, a refined version of CARMA available as a web application for the taxonomic and functional classification of unassembled (ultra-)short reads from metagenomic communities. In addition, we have analysed the applicability of ultra-short reads in metagenomics. Conclusions We show that unassembled reads as short as 35 bp can be used for the taxonomic classification of a metagenome. The web application is freely available at http://webcarma.cebitec.uni-bielefeld.d

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Publications at Bielefeld University

Investigation into the annotation of protocol sequencing steps in the sequence read archive

Author: A Brazma
A Brazma
A Seguin-Orlando
ER Mardis
ER Mardis
F Meacham
I Kozarewa
J Housby
J Orlowski
JA Sikorsky
JC Dohm
JH Eastberg
JR Miller
KD Hansen
M Allhoff
MA Quail
MG Ross
ML Metzker
MS Cheung
N Kamps-Hughes
P Keohavong
R Edgar
R Leinonen
S Spitaleri
SG Acinas
SL Schwartz
T Nakazato
X Jiao
YC Chen
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

BACKGROUND: The workflow for the production of high-throughput sequencing data from nucleic acid samples is complex. There are a series of protocol steps to be followed in the preparation of samples for next-generation sequencing. The quantification of bias in a number of protocol steps, namely DNA fractionation, blunting, phosphorylation, adapter ligation and library enrichment, remains to be determined. RESULTS: We examined the experimental metadata of the public repository Sequence Read Archive (SRA) in order to ascertain the level of annotation of important sequencing steps in submissions to the database. Using SQL relational database queries (using the SRAdb SQLite database generated by the Bioconductor consortium) to search for keywords commonly occurring in key preparatory protocol steps partitioned over studies, we found that 7.10%, 5.84% and 7.57% of all records (fragmentation, ligation and enrichment, respectively), had at least one keyword corresponding to one of the three protocol steps. Only 4.06% of all records, partitioned over studies, had keywords for all three steps in the protocol (5.58% of all SRA records). CONCLUSIONS: The current level of annotation in the SRA inhibits systematic studies of bias due to these protocol steps. Downstream from this, meta-analyses and comparative studies based on these data will have a source of bias that cannot be quantified at present

Crossref

Springer - Publisher Connector

Royal Holloway - Pure

PubMed Central

Spiral - Imperial College Digital Repository